Topic Analysis Using a Finite Mixture Model
نویسندگان
چکیده
We address the issue of 'topic analysis,' by which is determined a text's topic structure, which indicates what topics are included in a text, and how topics change within the text. We propose a novel approach to this issue, one based on statistical modeling and learning. We represent topics by means of word clusters, and employ a finite mixture model to represent a word distribution within a text. Our experimental results indicate that our method significantly outperforms a method that combines existing techniques. 1 I n t r o d u c t i o n -:We consider here the issue of 'topic analysis,' by which is determined a text's topic structure, which indicates what topics are included in a text and how topics change within the text. Topic analysis consists of two main tasks: topic identification and text segmentation (based on topic changes). Topic analysis is extremely useful in a variety of text processing applications. For examplea it can be used in the automatic indexing of texts for purposes of information retrieval. With it, one can understand what the main topics and subtopics of a text are, and where those subtopics lie within the text. To the best of our knowledge, however, no previous study has so far dealt with the topic analysis problem in the above sense. The most closely related are key word extraction and text segmentation. A keyword extraction method (e.g., that using tf-idf (Salton and Yang, 1973)) generally extracts from a text key words which represent topics within the text, but it does not conduct segmentation. A segmentation method (e.g., TextTiling (Hearst, 1997)) generally segments a text into blocks (paragraphs) in accord with topic changes within the text, but it does not identify (or label) by itself the topics discussed in each of the blocks. The purpose of tMs paper is to provide a single framework for conducting topic analysis, i.e., performing both topic identification and text segmentation. The key characteristics of our framework are 1) representing a topic by means of a cluster of words that are closely related to the topic, and 2) employing a stochastic model, called a .finite mixture model (e.g., (Everitt and Hand, 1981)), to represent a word distribution within a text. The finite mixture model has a hierarchical structure of probability distributions. The first level is a probability distribution of topics (topic distribution). The second level consists of probability distributions of words included within topics (word distributions). These word distributions are linearly combined to represent a word distribution within a text, with the topic distribution being used as the coefficient vector. Hereafter we refer to a finite mixture model having this structure as a stochastic topic model (STM). Before conducting topic analysis, we create word clusters (topics) on the basis of word cooccurrence in corpus data. We have developed a new method for word clustering using stochastic complexity (or the MDL principle) (Rissanen, 1996). In topic analysis, we estimate a sequence of STMs that would have given rise to a given text, assuming that each block of a text is generated by an individual STM. We perform text segmentation by detecting significant differences between STMs and perform topic identification by means of estimation of STMs. With the results, we obtain the text 's topic structure which consists of segmented blocks and their topics. It is possible to perform topic analysis by combining an existing word extraction method (e.g., tf-idf) and an existing text seg-
منابع مشابه
The Negative Binomial Distribution Efficiency in Finite Mixture of Semi-parametric Generalized Linear Models
Introduction Selection the appropriate statistical model for the response variable is one of the most important problem in the finite mixture of generalized linear models. One of the distributions which it has a problem in a finite mixture of semi-parametric generalized statistical models, is the Poisson distribution. In this paper, to overcome over dispersion and computational burden, finite ...
متن کاملExplaining Heterogeneity in Risk Preferences Using a Finite Mixture Model
This paper studies the effect of the space (distance) between lotteries' outcomes on risk-taking behavior and the shape of estimated utility and probability weighting functions. Previously investigated experimental data shows a significant space effect in the gain domain. As compared to low spaced lotteries, high spaced lotteries are associated with higher risk aversion for high probabilities o...
متن کاملModel Selection for Mixture Models Using Perfect Sample
We have considered a perfect sample method for model selection of finite mixture models with either known (fixed) or unknown number of components which can be applied in the most general setting with assumptions on the relation between the rival models and the true distribution. It is, both, one or neither to be well-specified or mis-specified, they may be nested or non-nested. We consider mixt...
متن کاملUnsteady Numerical Investigations of Flow and Heat Transfer Characteristics of Nanofluids in a Confined Jet Using Two-Phase Mixture Model
The development of high-performance thermal systems has increased interest in heat transfer enhancement techniques. The application of additives to heat transfer liquids is one of the noticeable effort to enhance heat transfer. In this paper two-dimensional unsteady incompressible nanofluid flow in a confined jet at the laminar flow regime is numerically investigated. The Mixture model is consi...
متن کاملدربارۀ مدلبندی آمیختۀ متناهی از طریق توزیع برنبام-ساندرز با میانگین و واریانس نرمال
‎ ‎This paper presents a new finite mixture model using the normal mean-variance‎ ‎mixture of Birnbaum-Saunders distribution‎. ‎The proposed model is multimodal with wider‎ ‎ranges of skewness and kurtosis‎. ‎Moreover‎, ‎it is useful for modeling highly asymmetric data in various theoretical and applied statistical problems‎. ‎The maxim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Inf. Process. Manage.
دوره 39 شماره
صفحات -
تاریخ انتشار 2000